Introduction

This is a case study to analyze data and make predictions from the data collected by a FitBit Fitness Tracker. The data can be fetched from here.

First Look

After unzipping the files, its clearly visible that the files are not arranged in any meaningful way. Lets arrange the data according to the timeline they represent i.e., daily, hourly, minutely. The directory structure will look similar to this.

.
├── daily
│   ├── dailyActivity_merged.csv
│   ├── dailyCalories_merged.csv
│   ├── dailyIntensities_merged.csv
│   ├── dailySteps_merged.csv
│   └── sleepDay_merged.csv
├── heartrate_seconds_merged.csv
├── hourly
│   ├── hourlyCalories_merged.csv
│   ├── hourlyIntensities_merged.csv
│   └── hourlySteps_merged.csv
├── minutes
│   ├── minuteCaloriesNarrow_merged.csv
│   ├── minuteCaloriesWide_merged.csv
│   ├── minuteIntensitiesNarrow_merged.csv
│   ├── minuteIntensitiesWide_merged.csv
│   ├── minuteMETsNarrow_merged.csv
│   ├── minuteSleep_merged.csv
│   ├── minuteStepsNarrow_merged.csv
│   └── minuteStepsWide_merged.csv
└── weightLogInfo_merged.csv

Before diving in the data let’s first install the required libraries and include them.

install.packages("readr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("hms")
install.packages("plotly")

Now let’s import them in our memory.

library(readr)
library(dplyr)
library(ggplot2)
library(hms)
library(plotly)
library(gridExtra)

Now we have our tools and are ready to dive in, we will start by importing all files from daily folder into our program.

dailyActivity_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyActivity_merged.csv")
dailyCalories_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyCalories_merged.csv")
dailyIntensities_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyIntensities_merged.csv")
dailySteps_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailySteps_merged.csv")

Verification and Integrity check

The purpose of this is to make sure our data is in the right format and if it has any NA values.

head(dailyActivity_merged)
## # A tibble: 6 × 15
##        Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
##     <dbl> <chr>             <dbl>         <dbl>           <dbl>            <dbl>
## 1  1.50e9 4/12/2016         13162          8.5             8.5                 0
## 2  1.50e9 4/13/2016         10735          6.97            6.97                0
## 3  1.50e9 4/14/2016         10460          6.74            6.74                0
## 4  1.50e9 4/15/2016          9762          6.28            6.28                0
## 5  1.50e9 4/16/2016         12669          8.16            8.16                0
## 6  1.50e9 4/17/2016          9705          6.48            6.48                0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## #   FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## #   SedentaryMinutes <dbl>, Calories <dbl>
head(dailyCalories_merged)
## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728
head(dailyIntensities_merged)
## # A tibble: 6 × 10
##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinu…
##        <dbl> <chr>                  <dbl>                <dbl>             <dbl>
## 1 1503960366 4/12/2016                728                  328                13
## 2 1503960366 4/13/2016                776                  217                19
## 3 1503960366 4/14/2016               1218                  181                11
## 4 1503960366 4/15/2016                726                  209                34
## 5 1503960366 4/16/2016                773                  221                10
## 6 1503960366 4/17/2016                539                  164                20
## # … with 5 more variables: VeryActiveMinutes <dbl>,
## #   SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## #   ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>
head(dailySteps_merged)
## # A tibble: 6 × 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705

After checking each column we can confirm that our data is in right format and there are no discrepancies.

Now we will check if there are any duplicate rows in the files. We will do this by defining a function which returns the number of duplicate rows.

count_duplicates <- function(dataframe){
  n <- dataframe %>% nrow() - dataframe %>% unique() %>% nrow() #number of duplicate rows
  return(n)
}

Now let’s call this function for every file.

count_duplicates(dailyActivity_merged)
## [1] 0
count_duplicates(dailyCalories_merged)
## [1] 0
count_duplicates(dailyIntensities_merged)
## [1] 0
count_duplicates(dailySteps_merged)
## [1] 0

Now all rows are unique and we will check for NA values in all the files.

dailyActivity_merged %>% is.na() %>% which()
## integer(0)
dailyCalories_merged %>% is.na() %>% which()
## integer(0)
dailyIntensities_merged %>% is.na() %>% which()
## integer(0)
dailySteps_merged %>% is.na() %>% which()
## integer(0)

There are no NA values in any of the files and our cleaning process is done.

Studying the data

Before we dive into the Process phase of our analysis process it’s important that we get familiar with the data first. Let’s check out the column names and try to find relations in the files.

We can use the colnames() function to see the column names of the files.

colnames(dailyActivity_merged)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(dailyCalories_merged)
## [1] "Id"          "ActivityDay" "Calories"
colnames(dailyIntensities_merged)
##  [1] "Id"                       "ActivityDay"             
##  [3] "SedentaryMinutes"         "LightlyActiveMinutes"    
##  [5] "FairlyActiveMinutes"      "VeryActiveMinutes"       
##  [7] "SedentaryActiveDistance"  "LightActiveDistance"     
##  [9] "ModeratelyActiveDistance" "VeryActiveDistance"
colnames(dailySteps_merged)
## [1] "Id"          "ActivityDay" "StepTotal"

Upon inspecting this data we find 3 things:

  1. They all have Id and ActivityDate in common
  2. All of them have 940 rows.
  3. dailyActivity_merged file has columns of all the tables.

The 3rd point of out observation implies that we can get rid of all the files except dailyActivity_merged.

rm(dailyCalories_merged)
rm(dailyIntensities_merged)
rm(dailySteps_merged)

Processing

In this phase we will get rid of redundant elements and rename some rows.

  1. We are concerned with total active minutes so we will introduce a new column active_minutes and that will be the summation of VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes and we will set SedentaryMinutes as inactive_minutes.
  2. We will also format our ActivityData to a date data-type as it will be appropriate.
  3. We will introduce a new column called weekday which will hold the day of the week on that date.
  4. We will alos change some column names.

All these changes will be saved to a new dataframe called daily_activity.

options(width = 1500)
daily_activity <- dailyActivity_merged %>% transform(active_minutes = VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes,  ActivityDate = as.Date(ActivityDate, format = "%m/%d/%Y") ,weekday = weekdays(as.Date(ActivityDate, format = "%m/%d/%Y"))) %>% rename(inactive_minutes = SedentaryMinutes, date = ActivityDate, total_steps = TotalSteps, total_distance = TotalDistance) %>% select(Id, date, weekday, total_steps, total_distance, active_minutes, inactive_minutes, Calories)

colnames(daily_activity) <- tolower(colnames(daily_activity))

Our new data-frame looks something like this

##           id       date   weekday total_steps total_distance active_minutes inactive_minutes calories
## 1 1503960366 2016-04-12   Tuesday       13162           8.50            366              728     1985
## 2 1503960366 2016-04-13 Wednesday       10735           6.97            257              776     1797
## 3 1503960366 2016-04-14  Thursday       10460           6.74            222             1218     1776
## 4 1503960366 2016-04-15    Friday        9762           6.28            272              726     1745
## 5 1503960366 2016-04-16  Saturday       12669           8.16            267              773     1863
## 6 1503960366 2016-04-17    Sunday        9705           6.48            222              539     1728

Analyzing & Visualizing relationships

Let’s create a scatter plot for total_steps and total_distance, we assume it to be directly proportional as total_distance should increase linearly with total_steps

As it is clearly visible that our hypothesis was correct, this also means that total_distance is a redundant attribute and we can use total_steps only. We can also check this corelation by using cor() like this:

cor(daily_activity$total_steps, daily_activity$total_distance)
## [1] 0.9853688

The 0.98 value signifies that there is a strong relationship between total_steps and total_distance.

Now lets plot total_steps, active_minutes and inactive_minutes against calories, our hypothesis is that 1. total_steps directly proportional to calories
2. active_minutes directly proportional to calories 3. inactive_minutes inversely proportional to calories

To understand the relationship more easily let’s add a regression line to each plot.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

As it’s clear that our hypothesis is correct as the slope for total_steps/calories and active_minutes/calories is positive, it shows linear growth and inactive_minutes/calories is negative. The relationship is not really strong as there is lot of variance in the data.

cor(daily_activity$total_steps, daily_activity$calories)
## [1] 0.5915681
cor(daily_activity$active_minutes, daily_activity$calories)
## [1] 0.4719975
cor(daily_activity$inactive_minutes, daily_activity$calories)
## [1] -0.106973

It is clear by the values that relationship is not strong and if we try to fit a linear model it will not be an apt one.

Now that we have ploted and analyzed the relationship in raw data, lets find mean values of the attributes through the week and analyze it.

order_days <- function(data, x){
  data$weekday <- factor(data$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
  return(data)
}

This function will order days in general Monday to Saturday order.

mean_daily <- daily_activity %>% group_by(weekday) %>% summarize(mean_active = mean(active_minutes), mean_inactive = mean(inactive_minutes), mean_steps = mean(total_steps), mean_calories = mean(calories))
mean_daily <- order_days(mean_daily)
head(mean_daily, 7)
## # A tibble: 7 × 5
##   weekday   mean_active mean_inactive mean_steps mean_calories
##   <fct>           <dbl>         <dbl>      <dbl>         <dbl>
## 1 Friday           236.         1000.      7448.         2332.
## 2 Monday           229.         1028.      7781.         2324.
## 3 Saturday         244.          964.      8153.         2355.
## 4 Sunday           208.          990.      6933.         2263 
## 5 Thursday         217.          962.      7406.         2200.
## 6 Tuesday          235.         1007.      8125.         2356.
## 7 Wednesday        224.          989.      7559.         2303.

This is the summarized data, we will visualize it now.

By analyzing the graphs we can find that people are 1. Most active on Saturday and Tuesday 2. Least active around Thursday and Sunday.

We will also analyze sleep and heartbeat data so let’s import sleep data first called sleepDay_merged

sleepDay_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/sleepDay_merged.csv")

We should first check for duplicate enties and if there are, remove them.

count_duplicates(sleepDay_merged)
## [1] 3
sleepDay_merged <- unique(sleepDay_merged)
count_duplicates(sleepDay_merged)
## [1] 0

Now that we have eliminated duplicates, let’s check for NA values

sleepDay_merged %>% is.na() %>% which()
## integer(0)
head(sleepDay_merged)
## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                             <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327            346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384            407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412            442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340            367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700            712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304            320

There are no NA values. Now we can move on to cleaning the data-frame. As we can see above that in SleepDay time is constant so we will only consider the date. We will change the SleepDay to date format, calculate weekday, and rename a few rows.

sleep_minutes <-  sleepDay_merged %>% rename(id = Id, date = SleepDay, sleep_time = TotalMinutesAsleep, count = TotalSleepRecords, bed_time = TotalTimeInBed) %>% transform(date = as.Date(date, "%m/%d/%Y %I:%M:%S %p")) %>% transform(weekday = weekdays.Date(date)) %>% select(id, date, weekday, count, sleep_time, bed_time)

Let’s plot some graphs to analyze our data. We will plot the following relations: 1. sleep_time VS bed_time 2. Mean sleep_time per weekday 3. Mean count per weekday

Let’s plot sleep_time VS bed_time

cor(sleep_minutes$sleep_time, sleep_minutes$bed_time)
## [1] 0.9304224

As we can see that data is highly linear and there is a strong linear relationship between these attributes, we will fit a regression model to this later on.

Now we will calculate a summary table to find means.

sleep_mean <- sleep_minutes %>% group_by(weekday) %>% summarize(mean_sleep_time = mean(sleep_time), mean_count = mean(count))
sleep_mean <- order_days(sleep_mean)
head(sleep_mean, 7)
## # A tibble: 7 × 3
##   weekday   mean_sleep_time mean_count
##   <fct>               <dbl>      <dbl>
## 1 Friday               405.       1.07
## 2 Monday               420.       1.11
## 3 Saturday             419.       1.19
## 4 Sunday               453.       1.18
## 5 Thursday             401.       1.03
## 6 Tuesday              405.       1.11
## 7 Wednesday            435.       1.15

Let’s plot mean_sleep_time per weekday.

We can conclude that people sleep: 1. Most on Wednesday and weekends(Sunday and Saturday) 2. Least on Thursday and Tuesday

We really don’t need to plot mean_count, we can just sort the table in descending order of mean_count.

head(sleep_mean %>% select(weekday, mean_count) %>% arrange(desc(mean_count)), 7)
## # A tibble: 7 × 2
##   weekday   mean_count
##   <fct>          <dbl>
## 1 Saturday        1.19
## 2 Sunday          1.18
## 3 Wednesday       1.15
## 4 Monday          1.11
## 5 Tuesday         1.11
## 6 Friday          1.07
## 7 Thursday        1.03

We can see that people take: 1. Most naps on Saturday, Sunday and Wednesday 2. Least naps on Thursday, Friday and Tuesday

We also concluded the same result from the previous plot.